Problem 1A:

Salary is hypothesized to depend on educational qualification and occupation. To understand the dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each person’s educational qualification and occupation are noted. Educational qualification is at three levels, High school graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and clerical, Sales, Professional or specialty, and Executive or managerial. A different number of observations are in each level of education – occupation combination.

[Assume that the data follows a normal distribution. In reality, the normality assumption may not always hold if the sample size is small.]

1a.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both Education and Occupation individually.

Hypothesis for conducting one-way ANOVA of education qualification with respect to salary is

H0: Salary depends on education Ha: Salary does not depends on education Alpha= 0.05

Hypothesis for conducting one-way ANOVA of occupation with respect to salary is

H0: Salary depends on occupation Ha: Salary does not depends on occupation Alpha= 0.05

1a.2. Perform a one-way ANOVA on Salary with respect to Education. State whether the null hypothesis is accepted or rejected based on the ANOVA results

H0: Salary depends on education Ha: Salary does not depends on education Alpha= 0.05

Found that the P value is less than 0.05. So, we can reject the null hypothesis.

1a.3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null hypothesis is accepted or rejected based on the ANOVA results.

H0: Salary depends on Occupation Ha: Salary does not depends on Occupation Alpha= 0.05

Found that the P value is greater than 0.05. So, we can fail to reject the null hypothesis. So, the null hypothesis is accepted and conclude that Salary depends on Occupation

1a.4. If the null hypothesis is rejected in either (2) or in (3), find out which class means are significantly different. Interpret the result.

We have rejected the null hypothesis for "Salary depends on education".

P value is smaller than the level of significance α 0.05. We have rejected the null hypothesis for "Salary depends on education". Education means are significantly different. Among all the means in the group at least one mean is substantially different.

Problem 1B:

1b.1. What is the interaction between two treatments? Analyze the effects of one variable on the other (Education and Occupation) with the help of an interaction plot.

As per the interaction plot result, we can find that the Adm-clerical and sales professionals with bachelors and doctorate degrees getting almost similar salary.

1b.2. Perform a two-way ANOVA based on Salary with respect to both Education and Occupation (along with their interaction Education*Occupation). State the null and alternative hypotheses and state your results. How will you interpret this result?

H0: All mean valuess are equal Ha: Atleast one mean value is not equal

The p-value is different from induvidual and interaction terms. In Two-Way ANOVA with and without interaction effect term of Education and Occupation is different. We can reject H0 Null Hypothesis.

1b.3. Explain the business implications of performing ANOVA for this particular case study.

Performing the ANOVA on this case study is useful to find out that what term affect the the salary. In that, we can able to conclude that salary highly depends on the occupation. As per the interaction plot, we can suggest to hire HS-grad as Adm-clerical and sales professionals, to hire HS-grad and Bachelors as Pro-speciality and to hire Bachelors and Doctorate as Exce-managerial.

--------------------------------------------------------------------------------------------------------------

Problem 2:

The dataset Education - Post 12th Standard.csv contains information on various colleges. You are expected to do a Principal Component Analysis for this case study according to the instructions given. The data dictionary of the 'Education - Post 12th Standard.csv' can be found in the following file: Data Dictionary.xlsx.

2.1. Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed]. What insight do you draw from the EDA?

Basic Data Exploration In this step, we will perform the below operations to check what the data set comprises of. We will check the below things:

head of the dataset shape of the dataset info of the dataset summary of the dataset

There is no duplicate value. So, we need not to remove any data.

Univariate Analysis

From above figures, we can say that the list of parameters are right skewed: Apps, Accept, Enroll, Top10perc, F.Undergrad, P.Undergrad, Personal, perc.alumni and Expend. The list of parameters are left skewed: PhD and Terminal. Other parameters are normally distributed.

Bivariate Analysis

Observation:

There are considerable number of features that are highly correlated

  1. Number of applications received is highly coorelated with Number of applications accepted and Number of new students enrolled.
  2. New students enrolled is highly coorelated with Number of full-time undergraduate students and Number of applications accepted.

2.2. Is scaling necessary for PCA in this case? Give justification and perform scaling.

We need to perform scaling for this case study. Because in the given data set Apps, Accept and F.Undergrad,.. etc are having values in hundreds and thousands and Top10perc, Top25perc and PhD.. etc are just two digits. Since the data in these variables are of different scales, it is tough to compare these variables.

We can see that all variables are normalized and scaled in one scale now.

2.3. Comment on the comparison between the covariance and the correlation matrices from this data [on scaled data].

While comparing the data after scaling, the covariance and the correlation matrices are almost equal.

2.4. Check the dataset for outliers before and after scaling. What insight do you derive here? [Please do not treat Outliers unless specifically asked to do so]

As per the box plots, we can see that outliers are limited to specified range in scaled data and all data are in one scale compared to the non scaled data.

2.5. Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]

2.6. Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame with the original features

2.7. Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors and corresponding features]

2.8. Consider the cumulative values of the eigenvalues. How does it help you to decide on the optimum number of principal components? What do the eigenvectors indicate?

The eigenvectors indicate the final components count limited as 9 in the array. With this indication we can get the ~92% information using 9 PCs

2.9 Explain the business implication of using the Principal Component Analysis for this case study. How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components Obtained]

The business implication of using PCA in this study is reducing the dimensionality of datasets, increasing interpretability but at the same time minimizing information loss. As per the analysis using cumlative explained variance, we can get the 92% of information by using 9 PCs out of 17 PCs. In this case, we have reduced ~50% dimensionality of datasets for minimum of 8% data loss.